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Abstract. The problem of computing the edit-distance between a string 
and a finite automaton arises in a variety of applications in computational 
biology, text processing, and speech recognition. This paper presents 
linear-space algorithms for computing the edit-distance between a string 
and an arbitrary weighted automaton over the tropical semiring, or an 
unambiguous weighted automaton over an arbitrary semiring. It also 
gives an efficient linear-space algorithm for finding an optimal alignment 
of a string and such a weighted automaton. 

1 Introduction 

The problem of computing the edit-distance between a string and a finite au- 
tomaton arises in a variety of applications in computational biology, text pro- 
cessing, and speech recognition [8,10,18,21,14]. This may be to compute the 
edit-distance between a protein sequence and a family of protein sequences com- 
pactly represented by a finite automaton [8, 10, 21], or to compute the error rate 
of a word lattice output by a speech recognition with respect to a reference 
transcription [14]. A word lattice is a weighted automaton, thus this further mo- 
tivates the need for computing the edit-distance between a string and a weighted 
automaton. In all these cases, an optimal alignment is also typically sought. In 
computational biology, this may be to infer the function and various properties 
of the original protein sequence from the one it is best aligned with. In speech 
recognition, this determines the best transcription hypothesis contained in the 
lattice. 

This paper presents linear-space algorithms for computing the edit-distance 
between a string and an arbitrary weighted automaton over the tropical semiring, 
or an unambiguous weighted automaton over an arbitrary semiring. It also gives 
an efficient linear-space algorithm for finding an optimal alignment of a string 
and such a weighted automaton. Our linear-space algorithms are obtained by 
using the same generic shortest-distance algorithm but by carefully defining dif- 
ferent queue disciplines. More precisely, our meta-queue disciplines are derived 
in the same way from an underling queue discipline defined over states with the 
same level. 



The connection between the edit-distance and the shortest distance in a 
directed graph was made very early on (see [10, 4-6] for a survey of string algo- 
rithms). This paper revisits some of these algorithms and shows that they are all 
special instances of the same generic shortest-distance algorithm using different 
queue disciplines. We also show that the linear-space algorithms all correspond 
to using the same meta-queue discipline using different underlying queues. Our 
approach thus provides a better understanding of these classical algorithms and 
make it possible to easily generalize them, in particular to weighted automata. 

The first algorithm to compute the edit-distance between a string x and a 
finite automaton A as well as their alignment was due to Wagner [25] (see also 
[26]). Its time complexity was in 0(|x| \A\q) and its space complexity 0(|A|q + 
\x\\A\q), where S denotes the alphabet and \A\q the number of states of A. 
Sankoff and Kruskal [23] pointed out that the time and space complexity 0(|x||A|) 
can be achieved when the automaton A is acyclic. Myers and Miller [17] signif- 
icantly improved on previous results. They showed that when A is acyclic or 
when it is a Thompson automaton, that is an automaton obtained from a reg- 
ular expression using Thompson's construction [24], the edit-distance between 
x and A can be computed in (3(|x||A|) time and 0(\x\ + \A\) space. They also 
showed, using a technique due to Hirschberg [11], that the optimal alignment 
between x and A can be obtained in 0(\x\ + \ A\) space, and in 0(|x||A|) time if 
A is acyclic, and in 0(|x||A| log x|) time when A is a Thompson automaton. 

The remainder of the paper is organized as follows. Section 2 introduces the 
definition of semirings, and weighted automata and transducers. In Section 3, 
we give a formal definition of the edit-distance between a string and a finite 
automaton, or a weighted automaton. Section 4 presents our linear-space algo- 
rithms, including the proof of their space and time complexity and a discussion 
of an improvement of the time complexity for automata with some favorable 
graph structure property. 

2 Preliminaries 

This section gives the standard definition and specifies the notation used for 
weighted transducers and automata which we use in our computation of the 
edit-distance. 

Finite-state transducers are finite automata [20] in which each transition is 
augmented with an output label in addition to the familiar input label [2,9]. 
Output labels arc concatenated along a path to form an output sequence and 
similarly input labels define an input sequence. Weighted transducers are finite- 
state transducers in which each transition carries some weight in addition to the 
input and output labels [22, 12]. Similarly, weighted automata are finite automata 
in which each transition carries some weight in addition to the input label. A 
path from an initial state to a final state is called an accepting path. A weighted 
transducer or weighted automaton is said to be unambiguous if it admits no two 
accepting paths with the same input sequence. 



The weights are elements of a semiring (K, ©, <8>, 0, 1), that is a ring that 
may lack negation [12]. Some familiar semirings are the tropical semiring (R + U 
{oc}, min, +, oo, 0) and the probability semiring (R + U {oo},+, x,0, 1), where 
R+ denotes the set of non-negative real numbers. In the following, we will only 
consider weighted automata and transducers over the tropical semiring. However, 
all the results of section 4.2 hold for an unambiguous weighted automaton A over 
an arbitrary semiring. 

The following gives a formal definition of weighted transducers. 

Definition 1. A weighted finite-state transducer T over the tropical semiring 
(R+ U {oo}, min, +, oo, 0) is an 8-tuple T = (S,A,Q,I,F,E,X,p) where S is 
the finite input alphabet of the transducer, A its finite output alphabet, Q is a 
finite set of states, ICQ the set of initial states, F C Q the set of final states, 
E C Q x (S U {e}) X (A U {e}) x (R + U {oo}) x Q a finite set of transitions, 
A : I — ► R+ U {oo} the initial weight function, and p : F — > R + U {oo} the final 
weight function mapping F to R + U {oo}. 

We define the size of T as \T\ = \T\ Q + \T\ E where \T\ Q = \Q\ is the number of 
states and \T\e = \E\ the number of transitions of T. 

The weight of a path n in T is obtained by summing the weights of its 
constituent transitions and is denoted by w{ir}. The weight of a pair of input and 
output strings (x, y) is obtained by taking the minimum of the weights of the 
paths labeled with (x, y) from an initial state to a final state. 

For a path tt, we denote by p[tt] its origin state and by n[n] its destination 
state. We also denote by P(I, x, y, F) the set of paths from the initial states / 
to the final states F labeled with input string x and output string y. The weight 
T(x, y) associated by T to a pair of strings (x, y) is defined by: 

T{x,y)= min A(p[tt]) + w[tt] + p[n[n]]. (1) 

ireP(I ,x,y,F) 

Figure 1(a) shows an example of weighted transducer over the tropical semiring. 

Weighted automata can be defined as weighted transducers A with identical 
input and output labels, for any transition. Thus, only pairs of the form (x, x) 
can have a non-zero weight by A, which is why the weight associated by A to 
(x, x) is abusively denoted by A(x) and identified with the weight associated by 
A to x. Similarly, in the graph representation of weighted automata, the output 
(or input) label is omitted. Figure 1(b) shows an example. 

3 Edit-distance 

We first give the definition of the edit-distance between a string and a finite 
automaton. 

Let S be a finite alphabet, and let Q be defined by Q = (S U {e}) x 
(S U {e}) — {(e, e)}. An element of fl can be seen as a symbol edit operation: 
(a, e) is a deletion, (e, a) an insertion, and (a, b) with a ^ b a substitution. We 
will denote by h the natural morphism between Q* and S* x S* defined by 



(a) 



(b) 



Fig. 1. (a) Example of a weighted transducer T. (b) Example of a weighted automaton 
A. T(aab,bba) = A(aab) = min(.l + .2 + .6 + .8, .2 + .4 + .5 + .8). A bold circle indicates 
an initial state and a double-circle a final state. The final weight p[q] of a final state q 
is indicated after the slash symbol representing q. 



h((ai,b\) ■ ■ ■ (a n ,b n )) = (oi ■ • ■ a n , b\ ■ ■ ■ b n ). An aligment u> between two strings 
x and y is an element of f2* such that h(u>) — (x,y). 

Let c : fl — > R+ be a function associating a non-negative cost to each edit op- 
eration. The cost of an alignment u = uo\ ■ ■ ■ u„ is denned as c(w) = J2?=i c ( w i)- 

Definition 2. The cdit-distance d(x, y) of two strings x and y is the minimal 
cost of a sequence of symbols insertions, deletions or substitutions transforming 
one string in the other: 

d(x,y)= mm c(w). (2) 

h(u>) = (x,y) 

When c is the function defined by c(a, a) = and c(a, e) = c(e, a) = c(a, b) = 
1 for all a, b in S such that a ^ b, the edit-distance is also known as the 
Levenshtein distance. The edit-distance d(x, A) between a string x and a finite 
automaton A can then be defined as 

d(x,A)= mm d(x,y), (3) 

y€L(A) 

where L(A) denotes the regular language accepted by A. The edit-distance d(x, A) 
between a string x and a weighted automaton A over the tropical semiring is 
defined as: 

d{x, A) = min (A(y) + d(x, y)) . (4) 



4 Algorithms 

In this section, we present linear-space algorithms both for computing the edit- 
distance d(x, A) between an arbitrary string x and an automaton A, and an 
optimal alignment between x and A, that is an alignment uj such that c{u) = 
d(x,A). 

We first briefly describe two general algorithms that we will use as subrou- 
tines. 



4.1 General algorithms 



Composition. The composition of two weighted transducers Ti and T 2 over the 
tropical semiring with matching input and output alphabets ZJ, is a weighted 
transducer denoted by T\ o T 2 defined by: 

(Ti o T 2 )(x, y) = min T^x, z) + T 2 (z, y). (5) 

Ti o T 2 can be computed from Ti and T 2 using the composition algorithm for 
weighted transducers of [19, 15]. States in the composition Ti o T 2 are identified 
with pairs of a state of Ti and a state of T 2 . In the absence of transitions with 
e inputs or outputs, the transitions of Ti o T 2 are obtained as a result of the 
following matching operation applied to the transitions of Ti and T 2 : 

{qi,a,b,wi,q[) and (q 2 ,b,c,w 2 ,q' 2 ) -> ((qi, q[), a, c, wi + w 2 , (q 2 , q' 2 )). (6) 

A state (gi, q 2 ) of T\oT 2 is initial (resp. final) iff q\ and q 2 are initial (resp. final) 
and, when it is final, its initial (resp. final) weight is the sum of the initial (resp. 
final) weights of q\ and q 2 . In the worst case, all transitions of Ti leaving a state 
gi match all those of T 2 leaving state q 2 , thus the space and time complexity of 
composition is quadratic, that is 0(|Ti | |T*2 1). 



Shortest distance. Let A be a weighted automaton over the tropical semiring. 
The shortest distance from p to q is defined as 

d[p,q]= min w{ir]. (7) 

TreP(p,q) 

It can be computed using the generic single-source shortest-distance algorithm 
of [13], a generalization of the classical shortest-distance algorithms. This generic 
shortest-distance algorithm works with an arbitrary queue discipline, that is the 
order according to which elements are extracted from a queue. We shall make use 
of this key property in our algorithms. The pseudocode of a simplified version 
of the generic algorithm for the tropical semiring is given in Figure 2. 

The complexity of the algorithm depends on the queue discipline selected for 
5*. Its general expression is 

0(\Q\ + C(A) maxN(q)| J B| + (C(l) + C(X)) £ N(q)), (8) 

where N(q) denotes the number of times state q is extracted from queue S, C(X) 
the cost of extracting a state from S, C(l) the cost of inserting a state in S, and 
C(A) the cost of an assignment. 

With a shortest-first queue discipline implemented using a heap, the algo- 
rithm coincides with Dijkstra's algorithm [7] and its complexity is 0((\E\ + 
\Q\) log |Q|). For an acyclic automaton and with the topological order queue 
discipline, the algorithm coincides with the standard linear-time (0(\Q\ + \E\)) 
shortest-distance algorithm [3]. 



Shortest-Distance(A, s) 



1 for each p £ Q do 

2 d[p] <— oo 

3 d[s] <- 
4 

5 while S do 

6 g <- Head(5') 

7 Dequeue(S) 

8 for each e 6 E[q] do 

9 if (d[s] + w[e] < d[n[e]\) then 

10 d[n[e]\ <- d[s] + w[e] 

11 if (n[e] S) then 

12 Enqueue(S, n[e]) 



Fig. 2. Pseudocode of the generic shortest-distance algorithm. 



4.2 Edit-distance algorithms 

The edit cost function c can be naturally represented by a one-state weighted 
transducer over the tropical semiring T c , or T in the absence of ambiguity, with 
each transition corresponding to an edit operation: T c = (£, S, {0}, {0}, E c , 1, 1) 
where E c = {(a,b,c(a,b),0)\(a,b) £ Q}. 

Lemma 1. Let A be a weighted automaton over the tropical semiring and let X 
be the finite automaton representing a string x. Then, the edit-distance between 
x and A is the shortest- distance from the initial state to a final state in the 
weighted transducer U = X oT o A. 

Proof Each transition e in T corresponds to edit operation (i[e], o[e}) <G f2, and 
each path ir corresponds to an alignment u> between i[n] and o[n]. The cost of 
that alignment is, by definition of T, c(w) = w[n]. Thus, T defines the function: 



for any strings u, v in S* . Since A is an automaton and x is the only string 
accepted by X, it follows from the definition of composition that U(x,y) = 
T(x,y) + A(y) = d(x,y) + A(y). The shortest-distance from the initial state to 
a final state in U is then: 



T(u, v) = min {c(ui) : h(w) = (u, v)} = d(u 7 v), 



(9) 



mm 

irePu(I,F) 




(10) 



= mi? (d{x, y) + A(y)) = d(x, A), 



(ii) 



that is the edit-distance between x and A. 



□ 





(a) 



(b) 




a:s/l 



B 

(c) 



(d) 



Fig. 3. (a) Finite automaton X representing the string x = aba. (b) Finite automaton 
A. (c) Edit transducer T over the alphabet {a, b} where the cost of any insertion, 
deletion and substitution is 1. (d) Weighted transducer U — X o T o A. 

Figure 3 shows an example illustrating Lemma 1. Using the lateral strategy 
of the 3-way composition algorithm of [1] or an ad hoc algorithm exploiting the 
structure of T, U = X oT o A can be computed in 0(\x\ \A\) time. The shortest- 
distance algorithm presented in Section 4.1 can then be used to compute the 
shortest distance from an initial state of U to a final state and thus the edit 
distance of x and A. Let us point out that different queue disciplines in the com- 
putation of that shortest distance lead to different algorithms and complexities. 
In the next section, we shall give a queue discipline enabling us to achieve a 
linear-space complexity. 

4.3 Edit-distance computation in linear space 

Using the shortest-distance algorithm described in Section 4.1 leads to an algo- 
rithm with space complexity linear in the size of U, i.e. in 0(|x||A|). However, 
taking advantage of the topology of U, it is possible to design a queue discipline 
that leads to a linear space complexity 0(\x\ + \A\). 

We assume that the finite automaton X representing the string x is topolog- 
ically sorted. A state q in the composition U = X oT o A can be identified with 
a triplet (i, 0, j) where i a state of X, the unique state of T, and j a state of A. 
Since T has a unique state, we further simplify the notation by identifying each 
state q with a pair (i, j). For a state q — of U, we will refer to i by the level 
of q. A key property of the levels is that there is a transition in U from q to q' 



iff lcvel(g') = lcvel(g) or levcl(g') = level(q) + 1. Indeed, a transition from (i,j) 
to in U must be obtain by taking a transition in X (in that case i' = i + 1 

since X is topologically sorted) or by staying at the same state in X and taking 
an input-e transition in T (in that case i' = i). 

From any queue discipline -< on the states of U, we can derive a new queue 
discipline -<i over U defined for all q, q' in U as follows: 

q -<i q' iff (level(g) < level(g')) or (level (?) = level(g') and q<q'). (12) 

Proposition 1. Let -< be a queue discipline that requires at most 0{\V\) space 
to maintain a queue over any set of states V . Then, the edit- distance between x 
and A can be computed in linear space, 0(\x\ + \A\), using the queue discipline 
hi- 
proof. The benefit of that queue discipline is that when computing the shortest 
distance to q = (i,j) in U, only the shortest distances to the states in U of level 
i and i — 1 need to be stored in memory. The shortest distances to the states of 
level strictly less than i — 1 can be safely discarded. Thus, the space required to 
store the shortest distances is in 0(|^4|q). 

Similarly, there is no need to store in memory the full transducer U. In- 
stead, we can keep in memory the last two levels active in the shortest-distance 
algorithm. This is possible because the computation of the outgoing transi- 
tions of a state with level i only requires knowledge about the states with 
level i and i + 1. Therefore, the space used to store the active part of U is 
inO(|^4|£ + |A|Q) = 0(|A|). Thus, it follows that the space required to compute 
the edit-distance of x and A is linear, that is in 0(\x\ + \A\). □ 

The time complexity of the algorithm depends on the underlying queue dis- 
cipline -<;. A natural choice is for -< is the shortest- first queue discipline, that 
is the queue discipline used in Dijkstra's algorithm. This yields the following 
corollary. 

Corollary 1. The edit-distance between a string x and an automaton A can 
be computed in time 0(\x\ \A\ log \A\q) and space 0(\x\ + \A\) using the queue 
discipline -<i. 

Proof. A shortest-first queue is maintained for each level and contains at most 
\A\q states. The cost for the global queue of an insertion, C(l), or an assignment, 
C(A), is in 0(log |^4|q) since it corresponds to inserting in or updating one of the 
underlying level queues. Since N(q) = 1, the general expression of the complexity 
(8) leads to an overall time complexity of 0(|arj |^4| log \A\q) for the shortest- 
distance algorithm. □ 

When the automaton A is acyclic, the time complexity can be further im- 
proved by using for -< the topological order queue discipline. 

Corollary 2. // the automaton A is acyclic, the edit-distance between x and 
A can be computed in time 0(\x\\A\) and space 0(\x\ + \A\) using the queue 
discipline -<i with the topological order queue discipline for -<. 



Proof. Computing the topological order for U would require 0(|t/|) space. In- 
stead, we will used the topological order on A, which can be computed in 0(|A|), 
to define the underlying queue discipline. The order inferred by (12) is then a 
topological order on U. □ 

Myers and Miller [17] showed that when A is a Thompson automaton, the 
time complexity can be reduced to 0(|x||A|) even when A is not acyclic. This is 
possible because of the following observation: in a weighted automaton over the 
tropical semiring, there exists always a shortest path that is simple, that is with 
no cycle, since cycle weights cannot decrease path weight. 

In general, it is not clear how to take advantage of this observation. However, 
a Thompson automaton has additionally the following structural property: a 
loop- connectedness of one. The loop-connectedness of A is k if in any depth- 
first search of A, a simple path goes through at most k back edges. [17] showed 
that this property, combined with the observation made previously, can be used 
to improve the time complexity of the algorithm. The results of [17] can be 
generalized as follows. 

Corollary 3. If the loop-connectedness of A is k, then the edit-distance between 
x and A can be computed in 0(\x\\A\k) time and 0(\x\ + \A\) space. 

Proof. We first use a depth-first search of A, identify back edges, and mark them 
as such. We then compute the topological order for A, ignoring these back edges. 
Our underlying queue discipline -< is defined such that a state q — (i, j) is ordered 
first based on the number of times it has been enqueued and secondly based on 
the order of j in the topological order ignoring back edges. This underlying queue 
can be implemented in 0(\A\ q) space with constant time costs for the insertion, 
extraction and updating operations. The order -<i derived from -< is then not 
topological for a transition e iff e was obtained by matching a back edge in A 
and level (p[e]) = level (n[e]). When such a transition e is visited, n[e] reinserted 
in the queue. 

When state q is dequeued for the Ith time, the value of d[q] is the weight of 
the shortest path from the initial state to q that goes through at most I — 1 back 
edges. Thus, the inequality N(q) < k + 1 holds for all q and, since the costs for 
managing the queue, C(l), C(A), and C(X), are constant, the time complexity of 
the algorithm is in 0(|a;||A|fc). □ 

4.4 Optimal alignment computation in linear space 

The algorithm presented in the previous section can also be used to compute an 
optimal alignment by storing a back pointer at each state in U. However, this 
can increase the space complexity up to 0(|x||A|q). The use of back pointers 
to compute the best alignment can be avoided by using a technique due to 
Hirschberg [11] (also used by [16, 17]). 

As pointed out in previous sections, an optimal alignment between x and A 
corresponds to a shortest path in U = X o T o A. We will say that a state q in U 
is a midpoint of an optimal alignment between x and A if q belongs to a shortest 
path in U and is such that level(g) = [|x|/2j. 



Lemma 2. Given a pair (x, A), a midpoint of the optimal alignment between x 
and A can be computed in 0(\x\ + \A\) space with a time complexity in 0(\x\\A\) 
if A is acyclic and in 0(\x\ \A\ log | A\q) otherwise. 

Proof. Let us consider U = X o T o A. For a state q in U let d[q) denote the 
shortest distance from the initial state to q, and by d R [q] the shortest distance 
from q to a final state. For a given state q = in U, d[(i,j)] + d R [(i, j)} is the 
cost of the shortest path going through Thus, for any i, the edit-distance 

between x and A is d(x,A) = mmj(d[(i,j)] + d R [(i,j)]). 

For a fixed io, we can compute both d[(io,j)] and d R [(io, j)] for all j in 
0(\x\ \A\ log | A\q) time (or 0(|x||A| time if A is acyclic) and in linear space 
0(\x\ + \A\) using the algorithm from the previous section forward and back- 
ward and stopping at level i in each case. Running the algorithm backward 
(exchanging initial and final states and permuting the origin and destination of 
every transition) can be seen as computing the edit-distance between x R and 
A R , the mirror images of x and A. 

Let us now set io = LM/2J an d jo = argmin 3 -(ci[(io, j)] + d R [(io, j)])- It 
then follows that (io, jo) is a midpoint of the optimal alignment. Hence, for a 
pair (x,A), the running-time complexity of determining the midpoint of the 
alignment is in 0(|x||A|) if A is acyclic and 0(|x||A log \A\q) otherwise. □ 

The algorithm proceeds recursively by first determining the midpoint of the 
optimal alignment. At step of the recursion, we first find the midpoint (io, jo) 
between x and A. Let x 1 and x 2 be such that x — x 1 x 2 and x 1 = i , and let 
A 1 and A 2 be the automaton obtained from A by respectively changing the final 
state to jo in A 1 and the initial state to jo in A 2 . We can now recursively find 
the alignment between x 1 and A 1 and between x 2 and A 2 . 

Theorem 1. An optimal alignment between a string x and an automaton A can 
be computed in linear space 0(\x\ + \A\) and in time 0(\x\\A\) if A is acyclic, 
0(\x\ \A\ log x log \A\q) otherwise. 

Proof. We can assume without loss of generality that the length of x is a 
power of 2. At step k of the recursion, we need to compute the midpoints 
for 2 k string-automaton pairs (x\, A l k ) 1<i<2 k . Thus, the complexity of step k 

is in 0(E£i K||4|log|4| Q ) = 0(^E£i |A* | log |^ fc | ) since \x{\ = \x\/2 k 
for all i. When A is acyclic, the log factor can be avoided and the equality 

J2i=i \ = \A\ holds, thus the time complexity of step k is in 0(\x\ \A\/2 k ). 
In the general case, each \A\\ can be in the order of \A\, thus the complexity of 
step k is in 0(|x||A| log |A|q). 

Since there are at most log \x\ steps in the recursion, this leads to an overall 
time complexity in 0(|a;||A|) if A is acyclic and 0(|x||A| log \ A\q log \x\) in gen- 
eral. □ 

When the loop-connectedness of A is k, then the time complexity can be im- 
proved to 0(fc|x||A log |x|) in the general case. 



5 Conclusion 



We presented general algorithms for computing in linear space both the edit- 
distance between a string and a finite automaton and their optimal alignment. 
Our algorithms are conceptually simple and make use of existing generic algo- 
rithms. Our results further provide a better understanding of previous algorithms 
for more restricted automata by relating them to shortest-distance algorithms 
and general queue disciplines. 
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